3 research outputs found

    Zorro: zero-cost reactive failure recovery in distributed graph processing

    Get PDF
    Distributed graph processing frameworks have become increasingly popular for processing large graphs. However, existing frameworks either lack the ability to recovery from failures or support proactive recovery methods. Proactive recovery methods like checkpointing incur high overheads during failure-free execution making failure recovery an expensive operation. Our hypothesis is that reactive recovery of failures in graph processing that provides a zero-overhead alternative to expensive proactive failure recovery mechanisms is feasible, novel and useful. We support the hypothesis with Zorro, a recovery protocol that reactively recovers from machine failures. Zorro utilizes vertex replication inherent in existing graph processing frameworks to collectively rebuild the state of failed servers. Surviving servers transfer the states of inherently replicated vertices back to replacement servers, which rebuild their state using the received values. This fast recovery mechanism prioritizes high degree vertices ensuring high accuracy of graph processing applications. We have implemented our approach in two existing distributed graph processing frameworks: LFGraph and PowerGraph. Experiments using graph applications on real-world graphs show that Zorro is able to recover between 87-92% graph state when half the cluster fails and maintains at least 97% accuracy in all experimental failure scenarios

    Zorro: Zero-Cost Reactive Failure Recovery in Distributed Graph Processing

    Get PDF
    Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems – PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication (inherent in today’s graph processing systems) to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when a few servers fail, and between 87-92% when half the cluster fails. Furthermore, using eight common graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios.Ope
    corecore